This is the companion notebook to the introductory course on simple portfolio construction and performance metrics. It contains simple examples of portfolio backtesting. The structures of this section will be used later on in the course.

\(\rightarrow\) ABOUT THE NOTEBOOK:

  • the code chunks are sequential. They must be executed in the correct order.
  • text between stars appears in bold in html format.

1 A glimpse at the S&P500

The S&P500 is one of the most widely scrutinised index in the US Equity investment space. It serves very often as benchmark. The purpose of this section is to unveil a few of its statistical properties and to use the highcharter package for highchart rendering in R.

1.1 Prices

We start by loading the packages and downloading the data.

if(!require(highcharter)){install.packages("highcharter")} # Package for nice financial graphs
There were 22 warnings (use warnings() to see them)
library(tidyverse)
library(lubridate)
library(quantmod)    # The package that eases the downloading of financial data
library(highcharter) # The package for financial time-series plots
min_date <- "1995-01-01"
max_date <- "2019-02-12"
prices <- getSymbols("^GSPC", src = 'yahoo',  # Yahoo source (strange ticker)
             from = min_date, 
             to = max_date,
             auto.assign = TRUE, 
             warnings = FALSE) %>% 
  map(~Ad(get(.))) %>% 
  reduce(merge) %>%
  `colnames<-`("SP500")

Then, we turn to plotting, using highchart format. We underline that this package works with special xts (R extensible time-series) format. We point to the package reference for more details on this subject.

hc <- highchart(type = "stock") %>%
    hc_title(text = "Evolution of the S&P500") %>%
    hc_add_series(prices)
htmltools::tagList(hc)              # Used for html export

export_hc(hc, filename = "hc.js")   # Export

A nice feature of highcharts is that they allow the user to change the observation period and see the values of points on the curve.

1.2 Returns

Next, we turn to the distribution of returns.

returns <- (prices/lag(prices) - 1) %>% # Formula for returns
    data.frame(Date = index(.))     %>% # Adding the Date as a column
    na.omit()                           # Removing NA rows
m <- mean(returns$SP500)                # Average daily return
s <- sd(returns$SP500)                  # Volatility = sd of daily returns
returns %>% ggplot() +                  # Plot
    geom_histogram(aes(x = SP500, y = ..density..), bins = 100) +
    stat_function(fun = dnorm, args = list(mean = m, sd = s), aes(color = "Gaussian"))

We plot the Gaussian distribution with parameters corresponding to the sample mean and standard deviation. Small grey rectangles around \(\pm 0.05\) indicate that large positive and negative returns occur more often than estimate by the Gaussian law: the tails of their distribution are notoriously heavy.

1.3 Volatility

Finally, we take a dynamic look at the volatility. We take a frugal approach; much more elegant methods are presented in Section 4.5 of Reproducible Finance by Jonathan Regenstein.

nb_days <- 63   # 63 days roughly equivalent to 3 months 
vol <- 0        # Initialisation
for(i in 1:(nrow(returns) - nb_days + 1)){          # Loop on dates: not elegant!
  vol[i] <- sd(returns$SP500[i:(i + nb_days - 1)])  # Vol computed on rolling window of nb_days
}
Date <- returns$Date[nb_days:nrow(returns)]
vol <- data.frame(vol*sqrt(252))
rownames(vol) <- Date
hc_vol <- highchart(type = "stock") %>%
    hc_title(text = "Evolution of volatility") %>%
    hc_add_series(as.xts(vol))
htmltools::tagList(hc_vol)              # Used for html export

Clearly the graph shows periods of low volatility and clusters of high market turbulence (crashes, most of the time). Returns are therefore not stationary. The properties we exhibited for the S&P500 are also true at the individual stock level.

In the next section, we turn to the core topic of portfolio backtesting.

2 A universe of functions

Backtesting portfolio strategies is usually simple at first, but becomes more intricate when many options are considered. To keep things simple, it is more efficient to work with functions. Functions help compartmentalise the different tasks of the process. We will build functions that compute portfolio weights and others that evaluate the performance metrics of the strategies. While we will usually work with a loop on backtesting dates, it is possible to build functions that directly generate portfolio returns (see below: the map() function).

But first, we start with the preprocessing of the data.

2.1 Data preparation

We first import and arrange the data. The data consists of monhtly financial information pertaining to 30 large US firms. They are characterised by their ticker symbol:

A - F G - M O - Z
AAPL (Apple) GE (General Electric) ORCL (Oracle)
BA (Boeing) HD (Home Depot) PFE (Pfizer)
BAC (Bank of America) IBM PG (Procter & Gamble)
C (Citigroup) INTC (Intel) T (AT&T)
CSCO (Cisco) JNJ (Johnson & Johnson) UNH (United Health)
CVS (CVS Health) JPM (JP Morgan) UPS
CVX (Chevron) K (Kellogg) VZ (Verizon)
D (Dominion Energy) MCK (McKesson) WFC (Wells Fargo)
DIS (Disney) MRK (Merck) WMT (Walmart)
F (Ford) MSFT (Microsoft) XOM (Exxon)

There are 7 attributes: closing price (Close), market capitalisation in M$ (Mkt_Cap), price-to-book ratio (P2B), 1 month volatility (Vol_1M), 1 month relative strength index (RSI_1M), debt-to-equity ratio (D2E) and profitability margin (Prof_Marg).
Finally, the time range is 2000-2018.

load("data.RData")                      # Loading the data: IF DIRECTORY OK!
data <- data %>% arrange(Date,Tick)     # Ranked first according to date and then stocks
summary(data)                           # Descriptive statistics
      Tick           Date                Close            Mkt_Cap            P2B           
 AAPL   : 221   Min.   :2000-01-03   Min.   :  1.004   Min.   :  4491   Min.   :   0.0922  
 BA     : 221   1st Qu.:2004-08-02   1st Qu.: 28.050   1st Qu.: 63498   1st Qu.:   2.0783  
 BAC    : 221   Median :2009-03-02   Median : 42.535   Median :128864   Median :   3.1054  
 C      : 221   Mean   :2009-03-02   Mean   : 57.519   Mean   :145819   Mean   :   6.6966  
 CSCO   : 221   3rd Qu.:2013-10-01   3rd Qu.: 66.045   3rd Qu.:198487   3rd Qu.:   5.2977  
 CVS    : 221   Max.   :2018-05-01   Max.   :557.000   Max.   :887952   Max.   :1259.1554  
 (Other):5304                                                                              
     Vol_1M            RSI_1M           D2E             Prof_Marg      
 Min.   :  5.276   Min.   :22.04   Min.   :    0.00   Min.   :-106.26  
 1st Qu.: 15.778   1st Qu.:46.12   1st Qu.:   29.36   1st Qu.:   5.11  
 Median : 21.607   Median :51.60   Median :   60.55   Median :  10.13  
 Mean   : 26.898   Mean   :51.64   Mean   :  191.92   Mean   :  11.70  
 3rd Qu.: 31.826   3rd Qu.:57.10   3rd Qu.:  205.02   3rd Qu.:  18.84  
 Max.   :265.429   Max.   :84.84   Max.   :14585.61   Max.   : 145.58  
                                                                       

This simple table shows a possible outlier for the P2B variable. The maximum value is clearly out of range.
Next, we format the data for future use. Notably, we compute returns.

data <- data  %>% 
    group_by(Tick) %>%                          # Grouping: returns computed stock-by-stock
    mutate(Return = Close / lag(Close) - 1) %>% # Adding returns
    na.omit()                                   # Take out missing values
returns <- data %>%                             # Take data
    select(Tick, Date, Return) %>%              # Select 3 columns
    spread(key = Tick, value = Return)          # Put them into 'matrix' format
returns                                         # Show the returns

By definition, a portfolio is a choice of weights that sum to one. Below, we implement two classical weighting schemes: the uniform portfolio (EW = equal weights) and the maximum Sharpe ratio portfolio (MSR). The latter is more complicated and requires inputs: namely the column vector of (expected) mean \(\mu\) and covariance matrix \(\Sigma\) of the assets. For simplicity, we will estimate them using sample moments - even though this is known to be a bad choice. This requires an additional argument in the function: the assets’ past returns. The MSR weights are \(w=\frac{\Sigma^{-1}\mu}{1'\Sigma^{-1}\mu}\). Both \(\mu\) and 1 are vectors here.

weights_msr <- function(returns){ # returns will refer to PAST returns
    m <- apply(returns, 2, mean)  # Vector of average returns
    sigma <- cov(returns)         # Covariance matrix
    w <- solve(sigma) %*% m       # Raw weights
    return(w / sum(w))            # Returns normalised weights
}
weights_ew <- function(returns){  # We keep the same syntax for simplicity
    N <- length(returns[1,])      # Number of assets
    return(rep(1/N,N))            # Weight = 1/N
}

We are now ready to proceed with the initialisation of the variables that will use in the main loop.

tick <- levels(data$Tick)               # Set of assets
t_all <- unique(data$Date)              # Set of dates
sep_date <- as.Date("2010-01-01")       # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date]        # Out-of-sample dates (i.e., testing set)
portf_weights <- matrix(0, nrow = length(t_oos), # Will store portfolio weights
                           ncol = length(tick)) 
portf_returns <- c()                    # Will store portfolio returns

2.2 A first example

At last, we can proceed to the main loop. A note of caution: both in the weighting scheme functions and in the loop below, we assume well defined (finite, i.e., non NA) data. Obviously, feeding NA data in the system will produce NA outputs. There are only four steps in the loop:

  1. extract the data
  2. compute portfolio weights
  3. compute realised returns
  4. derive the return of the portfolio
for(t in 1:length(t_oos)){
    temp_data <- returns %>% filter(Date < t_oos[t]) # 1) Extracting past data: expanding window! 
    portf_weights[t,] <- temp_data %>%               # 2) Take this past data 
        select(-Date) %>%                            # Take out the date column
        weights_msr()                                # Appply the weighting scheme function
    realised_returns <- returns %>%                  # 3) Take returns
        filter(Date ==  t_oos[t]) %>%                # Keep only current date
        select(-Date)                                # Take out the date column
    portf_returns[t] <- sum(portf_weights[t,] * realised_returns) # 4) Compute the return
}

In the above backtest, the amount of data considered to form the portfolio decision increases with time (expanding window). It is easy to fix the number of data point at each step and proceed on rolling windows (exercise). Likewise, switching from MSR to EW is immediate (just change the weighting function).

The main output is the vector of portfolio returns. We can easily plot the evolution of the portfolio through time.

port <- data.frame(t_oos, cumprod(1+portf_returns))         # Portf. values via cumulative product
colnames(port) <- c("Date", "Portfolio")                    # Changing column names
port %>% ggplot() + geom_line(aes(x = Date, y = Portfolio)) # Plot

Below, we create a function that takes a series of returns as input and provides a few simple performance metrics.

perf_met <- function(returns){
    avg_ret <- mean(returns, na.rm = T)                     # Arithmetic mean 
    vol <- sd(returns, na.rm = T)                           # Volatility
    Sharpe_ratio <- avg_ret / vol                           # Sharpe ratio
    VaR_5 <- quantile(returns, 0.05)                        # Value-at-risk
    met <- data.frame(avg_ret, vol, Sharpe_ratio, VaR_5)    # Aggregation of all of this
    rownames(met) <- "metrics"
    return(met)
}
perf_met(portf_returns) # Let's test on the actual returns

Note: the values are not annualised. The annualisation can be directly coded into the function. A heuristic way to proceed is to multiply average returns by 12 and volatilities by \(\sqrt{12}\) - though this omits the compounding effect. These simplifications are ok if they are used to compare strategies. The Value at Risk is nontheless dependent on the return horizon and cannot be proxied so simply.

An important indicator is the turnover of the portfolio: it assesses asset rotation and thus impacts transaction costs. Simple turnover computes average absolute variation in portfolio weights: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-1}^n|\). Full turnover takes into account the variation of the weights between rebalancing dates: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-}^n|\), where \(t-\) is the time just before the rebalancing date.

turnover_simple <- function(weights){
    turn <- 0
    for(t in 2:length(t_oos)){
          turn <- turn + sum(abs(weights[t,] - weights[t-1,])) # Variation in weights
    }
    return(turn/(length(t_oos)-1))  # BEWARE: monthly value!
}
turnover_simple(portf_weights)      # Simple turnover!
[1] 0.3035889

Below, we switch to the full definition of the turnover. In this case, the variation in weights is adjusted because the weight prior to rebalancing have drifted.

turnover_full <- function(weights, asset_returns, t_oos){
    turn <- 0
    for(t in 2:length(t_oos)){
        realised_returns <- returns %>% filter(Date == t_oos[t]) %>% select(-Date)
        prior_weights <- weights[t-1,] * (1 + realised_returns) # Before rebalancing
        turn <- turn + apply(abs(weights[t,] - prior_weights/sum(prior_weights)),1,sum)
    }
    return(turn/(length(t_oos)-1))
}
asset_returns <- returns %>% filter(Date > sep_date)    # Asset returns over the experiment
turnover_full(portf_weights, asset_returns, t_oos)      # Real turnover
[1] 0.3989153

Note that the turnover is computed at the monthly frequency. The second value is always more realistic; in this case it is substantially higher compared to the simplified proxy. For the sake of compactness, the turnover should be included in the perf_met() function.

3 Extensions

3.1 Comparing strategies

An important generalisation of the above framework is to consider more than one strategy. Comparisons are commonplace in the asset management industry: obviously, people look for the best strategy (according to particular goals, beliefs, and preferences). Below, we show how this can be handled. We will compare the two strategies that we mentionned above. First, we need to re-initiate the variables because their dimension will change (NOTE: an alternative route would be to work with lists).

Tt <- length(t_oos)                                             # Nb of computation dates 
# Avoid T because T = TRUE!
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

Second, we embed all weighting schemes into one single function.

weights_multi <- function(returns,j){   # Strategies are indexed by j
    if(j == 1){ # j = 1 => MSR
        return(weights_msr(returns))
    }
    if(j == 2){ # j = 2 => EW
        N <- length(returns[1,])
        return(rep(1/N,N))
    }
}

We decided to recode the EW strategy, but we could have used the weights_ew() function instead. Finally, the main loop is only marginally different from the single strategy loop.

for(t in 1:length(t_oos)){
    temp_data <- returns %>% 
        filter(Date < t_oos[t]) %>% 
        select(-Date)
    for(j in 1:nb_port){                           # This is the novelty: we loop on the strategies 
        portf_weights[t,j,] <- weights_multi(temp_data, j)  # The weights' function is indexed by j
        realised_returns <- returns %>% 
            filter(Date ==  t_oos[t]) %>% 
            select(-Date)
        portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns)
    }
}
apply(portf_returns,2,perf_met) %>%                     # Taking perf metrics
    unlist() %>%                                        # Flattening the list
    matrix(nrow = 2, byrow = T) %>%                     # Ordering them
    `colnames<-`(c("avg_ret", "vol", "SR", "VaR")) %>%  # Adding column names
    data.frame()                                        # Converting to dataframe
apply(portf_weights, 2, turnover_simple) %>% unlist()
[1] 0.3035889 0.0000000

We recall the order of strategies: MSR first (line) and EW second (line).

Again, turnover should (and will) be added to the performance metric function. Since uniform weights are constant, their simplified turnover is zero. In practice, that is not the case because weights evolve according to asset returns. The adjustment would imply only a small turnover.

There is a well-documented substantial difference between the two weigting schemes in terms of asset rotation. The MSR that we compute is clearly not competitive: even before transaction costs, its Sharpe ratio is smaller than that of the EW portfolio (see DeMiguel et al. (2009) for further evidence on the robustness of the EW portfolio).

3.2 Make do without loops

Finally, we show how to bypass the loop over dates. While this is not useful on small datasets, it can save time on large databases because loops are notoriously time-consuming. Below, we show how to proceed with the map() function in the simple case with only one strategy.

# We create a function that will compute returns for each date:
port_map <- function(t_oos, returns){             
    temp_data <- returns %>% filter(Date < t_oos) # Still expanding window...
    portf_weights <- temp_data %>% 
        select(-Date) %>% 
        weights_msr()
    realised_returns <- returns %>% 
        filter(Date ==  t_oos) %>% 
        select(-Date)
    return(sum(portf_weights * realised_returns))
}
# the map() function does it all!
portf_returns <- t_oos %>%                      # The variable over which we loop
    map(~port_map(.x, returns = returns)) %>%   # Is sent to the map() function
    unlist()                                    # The output is flattened
perf_met(portf_returns)                         # Compute the perf metrics

4 Characteristics-based choice

In this section, we start the chapter of strategies based on firm characteristics (features). As an illustration, we check a well-documented (though still controversial) anomaly: the size effect. We build two portfolios: in the first (resp. second) one, we invest in the firms that have a below (resp. above) median market capitalisation. The first portfolio will be a ‘small’ portfolio and the second one a ‘large’ one. Stocks are equally-weighted inside the portfolios.

First, we prepare the variables and define the weight function.

nb_port <- 2                                                      # Nb of portfolios
portf_weights <- array(0, dim = c(Tt-1, nb_port, length(tick)))   # Where we store weights
portf_returns <- matrix(0, nrow = Tt-1, ncol = nb_port)           # Where we store returns
weights_cap <- function(data,j){    
# More general than before: we feed all the data, not just returns
    m <- median(data$Mkt_Cap)       # Compute the median market cap
    n <- nrow(data)                 # Compute the number of assets
    if(j == 1){return((data$Mkt_Cap < m)/n*2)} # Small cap
    if(j == 2){return((data$Mkt_Cap > m)/n*2)} # Large cap
}

There will only be Tt-1 dates since we lose one because of the computation of future returns.

Second, we can launch the backtesting loop.

for(t in 2:length(t_oos)){
    temp_data <- data %>% filter(Date == t_oos[t-1])        # We keep the data of the previous date
    for(j in 1:nb_port){                                    # We loop on the strategies 
        portf_weights[t-1,j,] <- weights_cap(temp_data, j)  # The weights' function is indexed by j
        realised_returns <- returns %>% 
            filter(Date ==  t_oos[t]) %>% 
            select(-Date)
        portf_returns[t-1,j] <- sum(portf_weights[t-1,j,] * realised_returns)
    }
}

Third, we proceed to performance metrics.

apply(portf_returns,2,perf_met) %>%                     # Taking perf metrics
    unlist() %>%                                        # Flattening the list
    matrix(nrow = 2, byrow = T) %>%                     # Ordering them
    `colnames<-`(c("avg_ret", "vol", "SR", "VaR")) %>%  # Adding column names
    data.frame()                                        # Converting to dataframe

Small firms do indeed generate a higher level of performance! In order to reach this conclusion in a rigourous fashion, we would need to perform the same analysis on at least 1,000 stocks (ideally, more) and on 5 to 10 portfolio sorts (from very small firms to very large ones). Below, we check the weights of the portfolio on one particular date.

small <- portf_weights[2,1,] # t = 2, j = 1 (t_oos[2] = 2010-03-01, small firms)
large <- portf_weights[2,2,] # t = 2, j = 2 (t_oos[2] = 2010-03-01, large firms)
data.frame(small, large, row.names = tick)

Indeed, some stocks have zero weights and others 1/15.

Finally, let’s see how we could have coded those strategies using the tidyverse (and a lot of piping!).

data %>% filter(Date > sep_date) %>%            # Keep only the out-of-sample backtesting dates
    group_by(Tick) %>%                          # Group by stock
    mutate(F_Return = lead(Return)) %>%         # Compute forward (i.e., realised) return
    na.omit() %>%                               # Take out NAs
    group_by(Date) %>%                          # Group by dates
    mutate(Mkt_Cap_Binary = Mkt_Cap < median(Mkt_Cap)) %>%  # Compute median cap for each date
    group_by(Mkt_Cap_Binary) %>%                # Group by Mkt_Cap: small vs large
    summarise(avg_return = mean(F_Return))      # Simple pivot table

It can be useful to see how often stocks switch from one family to another (from below median to above median or vice-versa). Below, we show a plot of Mkt_Cap, conditional on Mkt_Cap being above the current median.

data %>% group_by(Date) %>%                         # Group by date
    mutate(Mkt_Cap_median = median(Mkt_Cap)) %>%    # Compute median cap for each date
    filter(Mkt_Cap > Mkt_Cap_median) %>%            # Keep only the large stocks
    ggplot(aes(x = Date, y = Mkt_Cap, color = Tick)) + geom_line() + ylim(75000,250000) +
        geom_line(aes(x = Date, y = Mkt_Cap_median), color = "black") 

# The black line shows the running median

The straight lines show the discontinuities: one stock being large at some point in time, then small and then large again. The straight lines show the periods when the stock was small. The black line shows the median capitalisation (in the sample). Finally, because we focus in the zone close to the median and impose an upper limit of 250B$, there are some missing points.

5 Exercises

5.1 Rolling window

Change the main loop so that only 60 points of data are given to the weighting scheme(s). Sixty points amount to 5 years of monthly data.

5.2 Other performance metrics

Using the PerformanceAnalytics package (install it first), compute the maximum drawdown via: https://www.rdocumentation.org/packages/PerformanceAnalytics/versions/1.5.2/topics/maxDrawdown

Compute the transaction cost-adjusted SR: \(TC-SR=(\bar{r}-0.005*Turn)/\sigma\).

Add both metrics to the perf_met() function.

5.3 Minimum variance

Add the MV portfolio to the set of strategies. The weights depend only on the covariance matrix: \(w=\frac{\Sigma^{-1}1}{1'\Sigma^{-1}1}\).

5.4 map() expertise

Extend the map() syntax to the case with many strategies.

5.5 Realistic portfolios

In practice, many saveguards are applied, if only to reduce turnover. One such example is box constraint: the weights in the portfolio must not lie above or below user-specified thresholds.

Add a winsorising feature in the weights function that allows to cut the weights accordingly. If possible, add arguments directly to the function. Don’t forget to normalise the weights ex-post!

---
title: "Portfolio basics"
output: 
  html_notebook:
    toc: true
    number_sections: true
    toc_float: 
      collapsed: false
---

This is the companion notebook to the introductory course on simple **portfolio construction** and **performance metrics**. It contains simple examples of **portfolio backtesting**. The structures of this section will be used later on in the course.

$\rightarrow$ **ABOUT THE NOTEBOOK**:   

- the code chunks are **sequential**. They must be executed in the correct order.  
- text between **stars** appears in bold in html format.  

# A glimpse at the S&P500
The **S&P500** is one of the most widely scrutinised index in the US Equity investment space. It serves very often as **benchmark**. The purpose of this section is to unveil a few of its statistical properties and to use the **highcharter** package for highchart rendering in R.

## Prices
We start by loading the packages and downloading the data.

```{r load, warning = FALSE, message = FALSE}
if(!require(highcharter)){install.packages("highcharter")} # Package for nice financial graphs
library(tidyverse)
library(lubridate)
library(quantmod)    # The package that eases the downloading of financial data
library(highcharter) # The package for financial time-series plots

min_date <- "1995-01-01"
max_date <- "2019-02-12"
prices <- getSymbols("^GSPC", src = 'yahoo',  # Yahoo source (strange ticker)
             from = min_date, 
             to = max_date,
             auto.assign = TRUE, 
             warnings = FALSE) %>% 
  map(~Ad(get(.))) %>% 
  reduce(merge) %>%
  `colnames<-`("SP500")
```


Then, we turn to plotting, using **highchart** format. We underline that this package works with special **xts** (R extensible time-series) format. We point to the package reference for more details on this subject.

```{r highchart}
hc <- highchart(type = "stock") %>%
    hc_title(text = "Evolution of the S&P500") %>%
    hc_add_series(prices)
htmltools::tagList(hc)              # Used for html export
export_hc(hc, filename = "hc.js")   # Export
```

<script type="text/javascript" src="hc.js"></script>

A nice feature of highcharts is that they allow the user to change the observation period and see the values of points on the curve. 

## Returns
Next, we turn to the distribution of returns.

```{r}
returns <- (prices/lag(prices) - 1) %>% # Formula for returns
    data.frame(Date = index(.))     %>% # Adding the Date as a column
    na.omit()                           # Removing NA rows
m <- mean(returns$SP500)                # Average daily return
s <- sd(returns$SP500)                  # Volatility = sd of daily returns

returns %>% ggplot() +                  # Plot
    geom_histogram(aes(x = SP500, y = ..density..), bins = 100) +
    stat_function(fun = dnorm, args = list(mean = m, sd = s), aes(color = "Gaussian"))
```

We plot the Gaussian distribution with parameters corresponding to the sample mean and standard deviation. Small grey rectangles around $\pm 0.05$ indicate that large positive and negative returns occur more often than estimate by the Gaussian law: the tails of their distribution are notoriously **heavy**.

## Volatility
Finally, we take a dynamic look at the volatility. We take a frugal approach; much more elegant methods are presented in Section 4.5 of **Reproducible Finance** by Jonathan Regenstein.

```{r vol}
nb_days <- 63   # 63 days roughly equivalent to 3 months 
vol <- 0        # Initialisation
for(i in 1:(nrow(returns) - nb_days + 1)){          # Loop on dates: not elegant!
  vol[i] <- sd(returns$SP500[i:(i + nb_days - 1)])  # Vol computed on rolling window of nb_days
}
Date <- returns$Date[nb_days:nrow(returns)]
vol <- data.frame(vol*sqrt(252))

rownames(vol) <- Date
hc_vol <- highchart(type = "stock") %>%
    hc_title(text = "Evolution of volatility") %>%
    hc_add_series(as.xts(vol))
htmltools::tagList(hc_vol)              # Used for html export
```

Clearly the graph shows periods of low volatility and **clusters** of high market turbulence (crashes, most of the time). Returns are therefore **not stationary**. The properties we exhibited for the S&P500 are also true at the individual stock level.

In the next section, we turn to the core topic of **portfolio backtesting**.


# A universe of functions

Backtesting portfolio strategies is usually simple at first, but becomes more intricate when many options are considered. To keep things simple, it is more efficient to work with functions. **Functions help compartmentalise the different tasks of the process**. We will build functions that compute portfolio weights and others that evaluate the performance metrics of the strategies. While we will usually work with a loop on backtesting dates, it is possible to build functions that directly generate portfolio returns (see below: the map() function).

But first, we start with the preprocessing of the data.

## Data preparation
We first import and arrange the data. 
The data consists of monhtly financial information pertaining to 30 large US firms. They are characterised by their **ticker** symbol:  

|A - F| G - M |O - Z|
| --- | ----- | --- |
| AAPL (Apple)| GE (General Electric) | ORCL (Oracle)
| BA (Boeing) | HD (Home Depot) | PFE (Pfizer)
| BAC (Bank of America) | IBM | PG (Procter & Gamble)
| C (Citigroup) | INTC (Intel) | T (AT&T)
|CSCO (Cisco) | JNJ (Johnson & Johnson) |  UNH (United Health)
|CVS (CVS Health) | JPM (JP Morgan) | UPS 
|CVX (Chevron) | K (Kellogg) | VZ (Verizon)
|D (Dominion Energy) | MCK (McKesson) | WFC (Wells Fargo)
|DIS (Disney) | MRK (Merck) | WMT (Walmart)
|F (Ford) | MSFT (Microsoft) | XOM (Exxon)


There are 7 attributes: closing price (**Close**), market capitalisation in M$ (**Mkt_Cap**), price-to-book ratio (**P2B**), 1 month volatility (**Vol_1M**), 1 month relative strength index (**RSI_1M**), debt-to-equity ratio (**D2E**) and profitability margin (**Prof_Marg**).   
Finally, the time range is 2000-2018.

```{r initiate, warning = FALSE, message = FALSE}
load("data.RData")                      # Loading the data: IF DIRECTORY OK!
data <- data %>% arrange(Date,Tick)     # Ranked first according to date and then stocks
summary(data)                           # Descriptive statistics
```

This simple table shows a possible outlier for the **P2B** variable. The maximum value is clearly out of range.  
Next, we format the data for future use. Notably, we compute returns.

```{r format, warning = FALSE}
data <- data  %>% 
    group_by(Tick) %>%                          # Grouping: returns computed stock-by-stock
    mutate(Return = Close / lag(Close) - 1) %>% # Adding returns
    na.omit()                                   # Take out missing values

returns <- data %>%                             # Take data
    select(Tick, Date, Return) %>%              # Select 3 columns
    spread(key = Tick, value = Return)          # Put them into 'matrix' format
returns                                         # Show the returns
```

By definition, a portfolio is a choice of weights that sum to one. Below, we implement two classical weighting schemes: the uniform portfolio (**EW** = equal weights) and the maximum Sharpe ratio portfolio (**MSR**). The latter is more complicated and requires inputs: namely the column vector of (expected) mean $\mu$ and covariance matrix $\Sigma$ of the assets. For simplicity, we will estimate them using **sample moments** - even though this is known to be a bad choice. This requires an additional argument in the function: the assets' past returns. The **MSR** weights are $w=\frac{\Sigma^{-1}\mu}{1'\Sigma^{-1}\mu}$. Both $\mu$ and 1 are vectors here.

```{r weights}
weights_msr <- function(returns){ # returns will refer to PAST returns
    m <- apply(returns, 2, mean)  # Vector of average returns
    sigma <- cov(returns)         # Covariance matrix
    w <- solve(sigma) %*% m       # Raw weights
    return(w / sum(w))            # Returns normalised weights
}

weights_ew <- function(returns){  # We keep the same syntax for simplicity
    N <- length(returns[1,])      # Number of assets
    return(rep(1/N,N))            # Weight = 1/N
}
```


We are now ready to proceed with the **initialisation** of the variables that will use in the main loop.

```{r init_2, warning = FALSE}
tick <- levels(data$Tick)               # Set of assets
t_all <- unique(data$Date)              # Set of dates
sep_date <- as.Date("2010-01-01")       # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date]        # Out-of-sample dates (i.e., testing set)
portf_weights <- matrix(0, nrow = length(t_oos), # Will store portfolio weights
                           ncol = length(tick)) 
portf_returns <- c()                    # Will store portfolio returns
```

## A first example
At last, we can proceed to the main loop. A note of caution: both in the weighting scheme functions and in the loop below, we assume well defined (finite, i.e., non NA) data. Obviously, feeding **NA** data in the system will produce **NA** outputs. There are only four steps in the loop:  

1. extract the data  
2. compute portfolio weights  
3. compute realised returns  
4. derive the return of the portfolio

```{r main}
for(t in 1:length(t_oos)){
    temp_data <- returns %>% filter(Date < t_oos[t]) # 1) Extracting past data: expanding window! 
    portf_weights[t,] <- temp_data %>%               # 2) Take this past data 
        select(-Date) %>%                            # Take out the date column
        weights_msr()                                # Appply the weighting scheme function
    realised_returns <- returns %>%                  # 3) Take returns
        filter(Date ==  t_oos[t]) %>%                # Keep only current date
        select(-Date)                                # Take out the date column
    portf_returns[t] <- sum(portf_weights[t,] * realised_returns) # 4) Compute the return
}
```

In the above **backtest**, the amount of data considered to form the portfolio decision increases with time (**expanding window**). It is easy to fix the number of data point at each step and proceed on rolling windows (exercise). Likewise, switching from **MSR** to **EW** is immediate (just change the weighting function).  

The main output is the vector of portfolio returns. We can easily plot the **evolution** of the portfolio through time.

```{r port_plot, warning = FALSE}
port <- data.frame(t_oos, cumprod(1+portf_returns))         # Portf. values via cumulative product
colnames(port) <- c("Date", "Portfolio")                    # Changing column names
port %>% ggplot() + geom_line(aes(x = Date, y = Portfolio)) # Plot
```


Below, we create a function that takes a series of returns as input and provides a few simple **performance metrics**.

```{r perf_met, warning = FALSE, message = FALSE}
perf_met <- function(returns){
    avg_ret <- mean(returns, na.rm = T)                     # Arithmetic mean 
    vol <- sd(returns, na.rm = T)                           # Volatility
    Sharpe_ratio <- avg_ret / vol                           # Sharpe ratio
    VaR_5 <- quantile(returns, 0.05)                        # Value-at-risk
    met <- data.frame(avg_ret, vol, Sharpe_ratio, VaR_5)    # Aggregation of all of this
    rownames(met) <- "metrics"
    return(met)
}
perf_met(portf_returns) # Let's test on the actual returns
```

Note: the values are not annualised. The annualisation can be directly coded into the function. A heuristic way to proceed is to multiply average returns by 12 and volatilities by $\sqrt{12}$ - though this omits the compounding effect. These simplifications are ok if they are used to compare strategies. The Value at Risk is nontheless dependent on the return horizon and cannot be proxied so simply.   

An important indicator is the **turnover** of the portfolio: it assesses asset rotation and thus impacts transaction costs. Simple turnover computes average absolute variation in portfolio weights: $\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-1}^n|$. Full turnover takes into account the variation of the weights between rebalancing dates: $\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-}^n|$, where $t-$ is the time just before the rebalancing date. 

```{r turnover, warning = FALSE, message = FALSE}
turnover_simple <- function(weights){
    turn <- 0
    for(t in 2:length(t_oos)){
          turn <- turn + sum(abs(weights[t,] - weights[t-1,])) # Variation in weights
    }
    return(turn/(length(t_oos)-1))  # BEWARE: monthly value!
}
turnover_simple(portf_weights)      # Simple turnover!
```

Below, we switch to the full definition of the **turnover**. In this case, the variation in weights is adjusted because the weight prior to rebalancing have drifted.

```{r turnover_full, warning = FALSE, message = FALSE}
turnover_full <- function(weights, asset_returns, t_oos){
    turn <- 0
    for(t in 2:length(t_oos)){
        realised_returns <- returns %>% filter(Date == t_oos[t]) %>% select(-Date)
        prior_weights <- weights[t-1,] * (1 + realised_returns) # Before rebalancing
        turn <- turn + apply(abs(weights[t,] - prior_weights/sum(prior_weights)),1,sum)
    }
    return(turn/(length(t_oos)-1))
}
asset_returns <- returns %>% filter(Date > sep_date)    # Asset returns over the experiment
turnover_full(portf_weights, asset_returns, t_oos)      # Real turnover
```

Note that the turnover is computed at the monthly frequency. The second value is always more realistic; in this case it is substantially higher compared to the simplified proxy. For the sake of compactness, the turnover should be included in the **perf_met**() function.


# Extensions
## Comparing strategies

An important generalisation of the above framework is to consider more than one strategy. Comparisons are commonplace in the asset management industry: obviously, people look for the *best* strategy (according to particular goals, beliefs, and preferences). Below, we show how this can be handled. We will compare the two strategies that we mentionned above.
First, we need to re-initiate the variables because their dimension will change (**NOTE**: an alternative route would be to work with lists). 

```{r multi_init, warning = FALSE, message = FALSE}
Tt <- length(t_oos)                                             # Nb of computation dates 
# Avoid T because T = TRUE!
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns
```

Second, we embed all weighting schemes into one single function.

```{r multi_weights}
weights_multi <- function(returns,j){   # Strategies are indexed by j
    if(j == 1){ # j = 1 => MSR
        return(weights_msr(returns))
    }
    if(j == 2){ # j = 2 => EW
        N <- length(returns[1,])
        return(rep(1/N,N))
    }
}
```

We decided to recode the **EW** strategy, but we could have used the **weights_ew**() function instead.
Finally, the main loop is only marginally different from the single strategy loop.

```{r multi_loop, warning = FALSE, message = FALSE}
for(t in 1:length(t_oos)){
    temp_data <- returns %>% 
        filter(Date < t_oos[t]) %>% 
        select(-Date)
    for(j in 1:nb_port){                           # This is the novelty: we loop on the strategies 
        portf_weights[t,j,] <- weights_multi(temp_data, j)  # The weights' function is indexed by j
        realised_returns <- returns %>% 
            filter(Date ==  t_oos[t]) %>% 
            select(-Date)
        portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns)
    }
}
apply(portf_returns,2,perf_met) %>%                     # Taking perf metrics
    unlist() %>%                                        # Flattening the list
    matrix(nrow = 2, byrow = T) %>%                     # Ordering them
    `colnames<-`(c("avg_ret", "vol", "SR", "VaR")) %>%  # Adding column names
    data.frame()                                        # Converting to dataframe
apply(portf_weights, 2, turnover_simple) %>% unlist()
```

We recall the order of strategies: **MSR** first (line) and **EW** second (line).  

Again, turnover should (and will) be added to the performance metric function. Since uniform weights are constant, their simplified turnover is zero. In practice, that is not the case because weights evolve according to asset returns. The adjustment would imply only a small turnover.

There is a well-documented substantial difference between the two weigting schemes in terms of asset rotation. The MSR that we compute is clearly not competitive: even before transaction costs, its Sharpe ratio is smaller than that of the EW portfolio (see DeMiguel et al. (2009) for further evidence on the robustness of the EW portfolio).


## Make do without loops

Finally, we show how to **bypass the loop** over dates. While this is not useful on small datasets, it can save time on large databases because loops are notoriously time-consuming. Below, we show how to proceed with the **map**() function in the simple case with only one strategy.

```{r map, warning = FALSE, message = FALSE}
# We create a function that will compute returns for each date:
port_map <- function(t_oos, returns){             
    temp_data <- returns %>% filter(Date < t_oos) # Still expanding window...
    portf_weights <- temp_data %>% 
        select(-Date) %>% 
        weights_msr()
    realised_returns <- returns %>% 
        filter(Date ==  t_oos) %>% 
        select(-Date)
    return(sum(portf_weights * realised_returns))
}
# the map() function does it all!
portf_returns <- t_oos %>%                      # The variable over which we loop
    map(~port_map(.x, returns = returns)) %>%   # Is sent to the map() function
    unlist()                                    # The output is flattened
perf_met(portf_returns)                         # Compute the perf metrics
```


# Characteristics-based choice
In this section, we start the chapter of strategies based on firm characteristics (**features**). As an illustration, we check a well-documented (though still controversial) **anomaly**: the **size effect**. We build two portfolios: in the first (*resp*. second) one, we invest in the firms that have a below (*resp*. above) median market capitalisation. The first portfolio will be a 'small' portfolio and the second one a 'large' one. Stocks are equally-weighted inside the portfolios.

First, we prepare the variables and define the **weight function**.

```{r chars_1, warnings = FALSE, message = FALSE}
nb_port <- 2                                                      # Nb of portfolios
portf_weights <- array(0, dim = c(Tt-1, nb_port, length(tick)))   # Where we store weights
portf_returns <- matrix(0, nrow = Tt-1, ncol = nb_port)           # Where we store returns

weights_cap <- function(data,j){    
# More general than before: we feed all the data, not just returns
    m <- median(data$Mkt_Cap)       # Compute the median market cap
    n <- nrow(data)                 # Compute the number of assets
    if(j == 1){return((data$Mkt_Cap < m)/n*2)} # Small cap
    if(j == 2){return((data$Mkt_Cap > m)/n*2)} # Large cap
}
```

There will only be Tt-1 dates since we lose one because of the computation of future returns.  

Second, we can launch the **backtesting loop**.

```{r chars_2, warnings = FALSE, message = FALSE}
for(t in 2:length(t_oos)){
    temp_data <- data %>% filter(Date == t_oos[t-1])        # We keep the data of the previous date
    for(j in 1:nb_port){                                    # We loop on the strategies 
        portf_weights[t-1,j,] <- weights_cap(temp_data, j)  # The weights' function is indexed by j
        realised_returns <- returns %>% 
            filter(Date ==  t_oos[t]) %>% 
            select(-Date)
        portf_returns[t-1,j] <- sum(portf_weights[t-1,j,] * realised_returns)
    }
}
```

Third, we proceed to performance metrics.

```{r chars_3, warnings = FALSE, message = FALSE}
apply(portf_returns,2,perf_met) %>%                     # Taking perf metrics
    unlist() %>%                                        # Flattening the list
    matrix(nrow = 2, byrow = T) %>%                     # Ordering them
    `colnames<-`(c("avg_ret", "vol", "SR", "VaR")) %>%  # Adding column names
    data.frame()                                        # Converting to dataframe
```

Small firms do indeed generate a higher level of performance! In order to reach this conclusion in a rigourous fashion, we would need to perform the same analysis on at least 1,000 stocks (ideally, more) and on 5 to 10 portfolio sorts (from very small firms to very large ones). Below, we check the weights of the portfolio on one particular date.

```{r check, warning = FALSE, message = FALSE}
small <- portf_weights[2,1,] # t = 2, j = 1 (t_oos[2] = 2010-03-01, small firms)
large <- portf_weights[2,2,] # t = 2, j = 2 (t_oos[2] = 2010-03-01, large firms)
data.frame(small, large, row.names = tick)
```

Indeed, some stocks have zero weights and others 1/15.  

Finally, let's see how we could have coded those strategies using the **tidyverse** (and a lot of piping!).

```{r pivot!, warning = FALSE, message = FALSE}
data %>% filter(Date > sep_date) %>%            # Keep only the out-of-sample backtesting dates
    group_by(Tick) %>%                          # Group by stock
    mutate(F_Return = lead(Return)) %>%         # Compute forward (i.e., realised) return
    na.omit() %>%                               # Take out NAs
    group_by(Date) %>%                          # Group by dates
    mutate(Mkt_Cap_Binary = Mkt_Cap < median(Mkt_Cap)) %>%  # Compute median cap for each date
    group_by(Mkt_Cap_Binary) %>%                # Group by Mkt_Cap: small vs large
    summarise(avg_return = mean(F_Return))      # Simple pivot table
```


It can be useful to see how often stocks switch from one family to another (from below median to above median or vice-versa). Below, we show a plot of **Mkt_Cap**, conditional on **Mkt_Cap** being above the current median. 

```{r cap_graph, warning = FALSE, message = FALSE}
data %>% group_by(Date) %>%                         # Group by date
    mutate(Mkt_Cap_median = median(Mkt_Cap)) %>%    # Compute median cap for each date
    filter(Mkt_Cap > Mkt_Cap_median) %>%            # Keep only the large stocks
    ggplot(aes(x = Date, y = Mkt_Cap, color = Tick)) + geom_line() + ylim(75000,250000) +
        geom_line(aes(x = Date, y = Mkt_Cap_median), color = "black") 
# The black line shows the running median
```

The straight lines show the **discontinuities**: one stock being large at some point in time, then small and then large again. The straight lines show the periods when the stock was small. The black line shows the median capitalisation (in the sample). Finally, because we focus in the zone close to the median and impose an upper limit of 250B$, there are some missing points.


# Exercises

## Rolling window

Change the main loop so that only 60 points of data are given to the weighting scheme(s). Sixty points amount to 5 years of monthly data.


```{r your turn!}

```


## Other performance metrics
Using the PerformanceAnalytics package (install it first), compute the maximum drawdown via:
https://www.rdocumentation.org/packages/PerformanceAnalytics/versions/1.5.2/topics/maxDrawdown

Compute the transaction cost-adjusted SR: $TC-SR=(\bar{r}-0.005*Turn)/\sigma$.

Add both metrics to the perf_met() function.

## Minimum variance
Add the MV portfolio to the set of strategies. The weights depend only on the covariance matrix: $w=\frac{\Sigma^{-1}1}{1'\Sigma^{-1}1}$.

## map() expertise
Extend the map() syntax to the case with many strategies.

## Realistic portfolios
In practice, many saveguards are applied, if only to reduce turnover. One such example is box constraint: the weights in the portfolio must not lie above or below user-specified thresholds. 

Add a winsorising feature in the weights function that allows to cut the weights accordingly. If possible, add arguments directly to the function. Don't forget to normalise the weights ex-post!

